Goto

Collaborating Authors

 interactive retrieval


Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Neural Information Processing Systems

This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators.


Reviews: Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Neural Information Processing Systems

The main problem for me is that the paper promises a very real scenario (Figure 1) of how a user can refine search by using a sequence of refined queries. However, majority of the model design and evaluation (except section 4.2) is performed with dense region captions that have almost no sequential nature. While this is partially a strength as no additional labels are required, the method seems suited especially towards such disconnected queries -- there is space for M disconnected queries and only then updates are required. This would provide a deeper understanding of when the proposed method works better. In Figure 1, the user queries seem very natural, but the simulated queries in Figure 1 are not.


Reviews: Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Neural Information Processing Systems

This paper investigates the problem of multi-round natural language image retrieval, using annotations from the Visual Genome dataset for training and evaluation. After feedback and reviewer discussion, this paper received final ratings of 6, 6 and 7. Despite some concerns about the use of non-sequential annotation data for a sequential task, the reviewers found the proposed model to be generally sound and the experimental evaluation convincing, and the AC agrees. However, we would encourage the authors to pay close attention to the reviewer feedback when preparing the final paper version. In particular, the author feedback committed to including the additional baselines requested by R1, so these should be included in the final version as promised.


Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Neural Information Processing Systems

This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators.


Balancing Reinforcement Learning Training Experiences in Interactive Information Retrieval

Chen, Limin, Tang, Zhiwen, Yang, Grace Hui

arXiv.org Artificial Intelligence

Interactive Information Retrieval (IIR) and Reinforcement Learning (RL) share many commonalities, including an agent who learns while interacts, a long-term and complex goal, and an algorithm that explores and adapts. To successfully apply RL methods to IIR, one challenge is to obtain sufficient relevance labels to train the RL agents, which are infamously known as sample inefficient. However, in a text corpus annotated for a given query, it is not the relevant documents but the irrelevant documents that predominate. This would cause very unbalanced training experiences for the agent and prevent it from learning any policy that is effective. Our paper addresses this issue by using domain randomization to synthesize more relevant documents for the training. Our experimental results on the Text REtrieval Conference (TREC) Dynamic Domain (DD) 2017 Track show that the proposed method is able to boost an RL agent's learning effectiveness by 22\% in dealing with unseen situations.


Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Tan, Fuwen, Cascante-Bonilla, Paola, Guo, Xiaoxiao, Wu, Hui, Feng, Song, Ordonez, Vicente

Neural Information Processing Systems

This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task.